Prototype extraction and adaptive OCR
Identifieur interne : 002033 ( Main/Exploration ); précédent : 002032; suivant : 002034Prototype extraction and adaptive OCR
Auteurs : Y. Xu [États-Unis] ; George Nagy (informaticien) [États-Unis]Source :
- IEEE Transactions on Pattern Analysis and Machine Intelligence [ 0162-8828 ] ; 1999.
Descripteurs français
- Pascal (Inist)
English descriptors
- KwdEn :
Abstract
To maintain OCR accuracy with decreasing quality of page image composition, production, and digitization, it is essential to tune the system to each document. We propose a prototype extraction method for document-specific OCR systems. The method automatically generates training samples from unsegmented text images and the corresponding transcripts. It is tolerant of transcription errors, so a transcript produced automatically by an imperfect omnifont OCR system can be used. The method is based on new algorithms for estimating character widths, character locations in a word, and match/nonmatch probabilities from unsegmented text. An experimental word recognition system is designed and developed to combine prototype extraction algorithms and segmentation-free word recognition. The system can adapt itself to different page images and achieve high recognition accuracy on heavily degraded print.
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream PascalFrancis, to step Corpus: 000781
- to stream PascalFrancis, to step Curation: 000013
- to stream PascalFrancis, to step Checkpoint: 000763
- to stream Main, to step Merge: 002144
- to stream Main, to step Curation: 002033
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Prototype extraction and adaptive OCR</title>
<author><name sortKey="Xu, Y" sort="Xu, Y" uniqKey="Xu Y" first="Y." last="Xu">Y. Xu</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Hewlett-Packard Lab</s1>
<s2>Palo Alto CA</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<wicri:noRegion>Hewlett-Packard Lab</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Nagy, G" sort="Nagy, G" uniqKey="Nagy G" first="G." last="Nagy">George Nagy (informaticien)</name>
<affiliation><country>États-Unis</country>
<placeName><settlement type="city">Troy (New York</settlement>
<region type="state">État de New York</region>
</placeName>
<orgName type="lab" n="5">Institut polytechnique Rensselaer</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">00-0247935</idno>
<date when="1999">1999</date>
<idno type="stanalyst">PASCAL 00-0247935 EI</idno>
<idno type="RBID">Pascal:00-0247935</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000781</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000013</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000763</idno>
<idno type="wicri:doubleKey">0162-8828:1999:Xu Y:prototype:extraction:and</idno>
<idno type="wicri:Area/Main/Merge">002144</idno>
<idno type="wicri:Area/Main/Curation">002033</idno>
<idno type="wicri:Area/Main/Exploration">002033</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Prototype extraction and adaptive OCR</title>
<author><name sortKey="Xu, Y" sort="Xu, Y" uniqKey="Xu Y" first="Y." last="Xu">Y. Xu</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Hewlett-Packard Lab</s1>
<s2>Palo Alto CA</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<wicri:noRegion>Hewlett-Packard Lab</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Nagy, G" sort="Nagy, G" uniqKey="Nagy G" first="G." last="Nagy">George Nagy (informaticien)</name>
<affiliation><country>États-Unis</country>
<placeName><settlement type="city">Troy (New York</settlement>
<region type="state">État de New York</region>
</placeName>
<orgName type="lab" n="5">Institut polytechnique Rensselaer</orgName>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">IEEE Transactions on Pattern Analysis and Machine Intelligence</title>
<title level="j" type="abbreviated">IEEE Trans Pattern Anal Mach Intell</title>
<idno type="ISSN">0162-8828</idno>
<imprint><date when="1999">1999</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">IEEE Transactions on Pattern Analysis and Machine Intelligence</title>
<title level="j" type="abbreviated">IEEE Trans Pattern Anal Mach Intell</title>
<idno type="ISSN">0162-8828</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Adaptive classification</term>
<term>Algorithms</term>
<term>Dynamic programming</term>
<term>Experiments</term>
<term>Feature extraction</term>
<term>Image analysis</term>
<term>Image quality</term>
<term>Image segmentation</term>
<term>Optical character recognition</term>
<term>Pattern matching</term>
<term>Template matching</term>
<term>Text reader</term>
<term>Theory</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Théorie</term>
<term>Extraction caractéristique</term>
<term>Qualité image</term>
<term>Segmentation image</term>
<term>Algorithme</term>
<term>Concordance forme</term>
<term>Programmation dynamique</term>
<term>Analyse image</term>
<term>Reconnaissance optique caractère</term>
<term>Expérience</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">To maintain OCR accuracy with decreasing quality of page image composition, production, and digitization, it is essential to tune the system to each document. We propose a prototype extraction method for document-specific OCR systems. The method automatically generates training samples from unsegmented text images and the corresponding transcripts. It is tolerant of transcription errors, so a transcript produced automatically by an imperfect omnifont OCR system can be used. The method is based on new algorithms for estimating character widths, character locations in a word, and match/nonmatch probabilities from unsegmented text. An experimental word recognition system is designed and developed to combine prototype extraction algorithms and segmentation-free word recognition. The system can adapt itself to different page images and achieve high recognition accuracy on heavily degraded print.</div>
</front>
</TEI>
<affiliations><list><country><li>États-Unis</li>
</country>
<region><li>État de New York</li>
</region>
<settlement><li>Troy (New York</li>
</settlement>
<orgName><li>Institut polytechnique Rensselaer</li>
</orgName>
</list>
<tree><country name="États-Unis"><noRegion><name sortKey="Xu, Y" sort="Xu, Y" uniqKey="Xu Y" first="Y." last="Xu">Y. Xu</name>
</noRegion>
<name sortKey="Nagy, G" sort="Nagy, G" uniqKey="Nagy G" first="G." last="Nagy">George Nagy (informaticien)</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 002033 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 002033 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= Pascal:00-0247935 |texte= Prototype extraction and adaptive OCR }}
This area was generated with Dilib version V0.6.32. |